Skip to content

feat: pass shard_id to eloqstore#463

Open
thweetkomputer wants to merge 6 commits intomainfrom
feat-shard-id-zc
Open

feat: pass shard_id to eloqstore#463
thweetkomputer wants to merge 6 commits intomainfrom
feat-shard-id-zc

Conversation

@thweetkomputer
Copy link
Collaborator

@thweetkomputer thweetkomputer commented Mar 18, 2026

Here are some reminders before you submit the pull request

  • Add tests for the change
  • Document changes
  • Reference the link of issue using fixes eloqdb/tx_service#issue_id
  • Reference the link of RFC if exists
  • Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Summary by CodeRabbit

  • Refactor

    • Database startup now accepts shard identification to support multi-shard deployments.
  • New Features

    • Configurable global request batch size added with a runtime flag and config option to tune throughput.
  • Bug Fixes

    • Added validation to reject outdated or mismatched standby checkpoint updates to avoid processing stale requests.
  • Chores

    • Updated an internal submodule and cleaned up a log message format for clarity.

@coderabbitai
Copy link

coderabbitai bot commented Mar 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

StartDB now requires a uint32_t shard_id parameter; the DataStore interface and EloqStoreDataStore signature were updated (default 0). Call sites were updated to pass data_shard_id. A new eloq_store_max_global_request_batch flag and init logic were added. The eloqstore submodule pointer was advanced. A term-guard was added to UpdateStandbyCkptTs in tx_service.

Changes

Cohort / File(s) Summary
Interface & Impl
store_handler/eloq_data_store_service/data_store.h, store_handler/eloq_data_store_service/eloq_store_data_store.h
Changed StartDB(int64_t term)StartDB(int64_t term, uint32_t shard_id) in base interface; EloqStoreDataStore now StartDB(int64_t term, uint32_t data_shard_id = 0) and forwards shard id to eloq start.
Call Site
store_handler/eloq_data_store_service/data_store_service.cpp
Updated calls to StartDB to pass the new data_shard_id argument (e.g., StartDB(term, data_shard_id)).
Config
store_handler/eloq_data_store_service/eloq_store_config.cpp
Added DEFINE_uint32(eloq_store_max_global_request_batch, 1000, ...) and initialize eloqstore_configs_.max_global_request_batch from flag or INI following existing precedence rules.
Submodule
store_handler/eloq_data_store_service/eloqstore
Updated eloqstore submodule commit pointer; no API changes in this diff.
tx_service change
tx_service/src/remote/cc_node_service.cpp
Added early validation of standby node term in UpdateStandbyCkptTs to discard mismatched/outdated requests; minor log formatting change in RequestSyncSnapshot.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • lzxddz
  • liunyl

Poem

"🐰 I hopped through headers, small and spry,
I nudged a shard ID, then waved goodbye.
A flag I planted, neat and bright,
Submodule moved — the build takes flight.
Tiny hops, big tail-wag delight 🥕"

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description only contains repository reminders as a checklist but lacks substantive details about the changes, their rationale, impact, testing status, or issue/RFC references despite the template requiring these elements. Provide a meaningful description explaining the feature purpose, list which files were modified, confirm testing status, and add issue/RFC references using the 'fixes' syntax where applicable.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: pass shard_id to eloqstore' accurately and clearly summarizes the primary change: adding shard_id parameter passing throughout the DataStore interface and implementations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat-shard-id-zc
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@store_handler/eloq_data_store_service/data_store_service.cpp`:
- Line 428: The implementations RocksDBDataStore::StartDB(int64_t term) and
RocksDBCloudDataStore::StartDB(int64_t term) have the wrong signature and must
be changed to match the base class virtual bool StartDB(int64_t term, uint32_t
shard_id); update both the method declarations and definitions to bool
RocksDBDataStore::StartDB(int64_t term, uint32_t shard_id) and bool
RocksDBCloudDataStore::StartDB(int64_t term, uint32_t shard_id), and adjust any
internal uses of the missing shard_id parameter; ensure the corresponding
factory code in rocksdb_data_store_factory.h and
rocksdb_cloud_data_store_factory.h and any callers (e.g., the call
shard_ref.data_store_->StartDB(term, data_shard_id)) compile against this
corrected signature.

In `@store_handler/eloq_data_store_service/data_store.h`:
- Line 65: The StartDB signature changed to StartDB(int64_t term, uint32_t
shard_id); update both the declaration and definition for
RocksDBDataStore::StartDB and RocksDBCloudDataStore::StartDB to match that exact
signature (include the uint32_t shard_id parameter) and keep the override
specifier; modify the corresponding .cpp implementations to accept and use the
shard_id parameter and adjust any internal calls or forwarded calls to StartDB
to pass the shard_id as well so the override contract and compilation succeed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c484af00-2d3e-4b5a-b248-7845534a8301

📥 Commits

Reviewing files that changed from the base of the PR and between 1a4729b and 9742c8e.

📒 Files selected for processing (4)
  • store_handler/eloq_data_store_service/data_store.h
  • store_handler/eloq_data_store_service/data_store_service.cpp
  • store_handler/eloq_data_store_service/eloq_store_data_store.h
  • store_handler/eloq_data_store_service/eloqstore

}

res = shard_ref.data_store_->StartDB(term);
res = shard_ref.data_store_->StartDB(term, data_shard_id);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Find all StartDB call sites to verify they pass the shard_id parameter
# Expect: All call sites should pass two parameters (term and shard_id), or the factories should be excluded from this PR intentionally

echo "=== All StartDB call sites ==="
rg -n --type=cpp 'StartDB\s*\(' -A2 -B2

echo ""
echo "=== Calls with only one parameter (potential issues) ==="
rg -n --type=cpp '->StartDB\s*\([^,)]+\)\s*;'

Repository: eloqdata/tx_service

Length of output: 7594


🏁 Script executed:

# Check the RocksDBDataStore::StartDB implementation
echo "=== RocksDBDataStore::StartDB implementation ==="
sed -n '85,110p' store_handler/eloq_data_store_service/rocksdb_data_store.cpp

echo ""
echo "=== RocksDBCloudDataStore::StartDB implementation ==="
sed -n '263,290p' store_handler/eloq_data_store_service/rocksdb_cloud_data_store.cpp

echo ""
echo "=== Check base class interface ==="
sed -n '60,70p' store_handler/eloq_data_store_service/data_store.h

Repository: eloqdata/tx_service

Length of output: 2277


Update RocksDBDataStore and RocksDBCloudDataStore signatures to match the base interface.

The change at line 428 correctly passes shard_id to StartDB, but two critical implementations have incorrect signatures that violate the base class contract:

  • RocksDBDataStore::StartDB(int64_t term) - should accept uint32_t shard_id parameter
  • RocksDBCloudDataStore::StartDB(int64_t term) - should accept uint32_t shard_id parameter

The base class interface defines: virtual bool StartDB(int64_t term, uint32_t shard_id) = 0;

This will cause method resolution failures when factories for these implementations (rocksdb_data_store_factory.h, rocksdb_cloud_data_store_factory.h) create and initialize data stores from line 428.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@store_handler/eloq_data_store_service/data_store_service.cpp` at line 428,
The implementations RocksDBDataStore::StartDB(int64_t term) and
RocksDBCloudDataStore::StartDB(int64_t term) have the wrong signature and must
be changed to match the base class virtual bool StartDB(int64_t term, uint32_t
shard_id); update both the method declarations and definitions to bool
RocksDBDataStore::StartDB(int64_t term, uint32_t shard_id) and bool
RocksDBCloudDataStore::StartDB(int64_t term, uint32_t shard_id), and adjust any
internal uses of the missing shard_id parameter; ensure the corresponding
factory code in rocksdb_data_store_factory.h and
rocksdb_cloud_data_store_factory.h and any callers (e.g., the call
shard_ref.data_store_->StartDB(term, data_shard_id)) compile against this
corrected signature.

* @return True if start successfully, otherwise false.
*/
virtual bool StartDB(int64_t term) = 0;
virtual bool StartDB(int64_t term, uint32_t shard_id) = 0;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read-only verification: find StartDB declarations/definitions in data store service code.
rg -nP -C2 '\bStartDB\s*\(\s*int64_t\s+term\s*(,\s*uint32_t\s+\w+\s*)?\)\s*(override)?' \
  store_handler/eloq_data_store_service

# Expected result:
# - No remaining `StartDB(int64_t term) override` in subclasses of DataStore.
# - All subclass declarations/definitions aligned to `(int64_t term, uint32_t shard_id)`.

Repository: eloqdata/tx_service

Length of output: 2179


Update RocksDBDataStore and RocksDBCloudDataStore to match the new StartDB signature.

The base class signature changed to StartDB(int64_t term, uint32_t shard_id), but rocksdb_data_store.h:51 and rocksdb_cloud_data_store.h:65 still declare StartDB(int64_t term) override. The corresponding definitions in .cpp files also use the old signature. This mismatch breaks the override contract and causes compilation failure. Update both the header declarations and .cpp implementations in both files.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@store_handler/eloq_data_store_service/data_store.h` at line 65, The StartDB
signature changed to StartDB(int64_t term, uint32_t shard_id); update both the
declaration and definition for RocksDBDataStore::StartDB and
RocksDBCloudDataStore::StartDB to match that exact signature (include the
uint32_t shard_id parameter) and keep the override specifier; modify the
corresponding .cpp implementations to accept and use the shard_id parameter and
adjust any internal calls or forwarded calls to StartDB to pass the shard_id as
well so the override contract and compilation succeed.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
store_handler/eloq_data_store_service/eloq_store_config.cpp (1)

809-815: Please add a config-precedence test for this new parameter.

This adds a new CLI-vs-INI precedence path; a focused test would prevent silent regressions in parsing behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@store_handler/eloq_data_store_service/eloq_store_config.cpp` around lines 809
- 815, Add a focused unit test that verifies the new precedence for
eloqstore_configs_.max_global_request_batch: exercise three cases—(1) CLI flag
set (simulate setting FLAGS_eloq_store_max_global_request_batch and ensure
CheckCommandLineFlagIsDefault returns false) with an INI value different from
the flag and assert the final value equals the CLI flag; (2) CLI flag left
default (simulate CheckCommandLineFlagIsDefault true) and INI provides a value
and assert the final value equals config_reader.GetInteger result; and (3)
neither CLI nor INI provide a value and assert the default
FLAGS_eloq_store_max_global_request_batch is used—use the code paths around
eloqstore_configs_.max_global_request_batch, CheckCommandLineFlagIsDefault,
FLAGS_eloq_store_max_global_request_batch, and config_reader.GetInteger to
locate where to wire the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@store_handler/eloq_data_store_service/eloq_store_config.cpp`:
- Around line 809-815: Add a focused unit test that verifies the new precedence
for eloqstore_configs_.max_global_request_batch: exercise three cases—(1) CLI
flag set (simulate setting FLAGS_eloq_store_max_global_request_batch and ensure
CheckCommandLineFlagIsDefault returns false) with an INI value different from
the flag and assert the final value equals the CLI flag; (2) CLI flag left
default (simulate CheckCommandLineFlagIsDefault true) and INI provides a value
and assert the final value equals config_reader.GetInteger result; and (3)
neither CLI nor INI provide a value and assert the default
FLAGS_eloq_store_max_global_request_batch is used—use the code paths around
eloqstore_configs_.max_global_request_batch, CheckCommandLineFlagIsDefault,
FLAGS_eloq_store_max_global_request_batch, and config_reader.GetInteger to
locate where to wire the test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 28fa570e-afbe-45c0-9927-72a691758b41

📥 Commits

Reviewing files that changed from the base of the PR and between 9742c8e and 560ce10.

📒 Files selected for processing (2)
  • store_handler/eloq_data_store_service/eloq_store_config.cpp
  • store_handler/eloq_data_store_service/eloqstore
✅ Files skipped from review due to trivial changes (1)
  • store_handler/eloq_data_store_service/eloqstore

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@store_handler/eloq_data_store_service/eloq_store_data_store.h`:
- Around line 180-183: The StartDB override currently declares a default value
for data_shard_id which differs from the base DataStore::StartDB signature;
remove the default argument from the derived declaration so the signature is
identical (change bool StartDB(int64_t term, uint32_t data_shard_id = 0) to bool
StartDB(int64_t term, uint32_t data_shard_id)), leaving the body that calls
eloq_store_service_->Start(term, data_shard_id) unchanged; ensure any callers
rely on the base's parameter semantics rather than the removed default.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3c3a31d0-4ccf-4dd7-8c0d-937187126a45

📥 Commits

Reviewing files that changed from the base of the PR and between 560ce10 and 917bc41.

📒 Files selected for processing (1)
  • store_handler/eloq_data_store_service/eloq_store_data_store.h

Comment on lines +180 to +183
bool StartDB(int64_t term, uint32_t data_shard_id = 0) override
{
::eloqstore::KvError res = eloq_store_service_->Start(term);
::eloqstore::KvError res =
eloq_store_service_->Start(term, data_shard_id);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== StartDB declarations/definitions =="
rg -n --type=cpp -C2 '\bStartDB\s*\('

echo
echo "== Potential one-argument StartDB invocations (missing shard_id) =="
rg -nP --type=cpp '\bStartDB\s*\(\s*[^,()]+\s*\)' -g '!**/build/**' -g '!**/dist/**'

echo
echo "Expected:"
echo "- No remaining one-argument call sites for StartDB."
echo "- Consistent signatures between DataStore and overrides."

Repository: eloqdata/tx_service

Length of output: 8177


Remove the default argument from StartDB override to match the base class interface.

Line 180 has data_shard_id = 0 only in the derived override, while DataStore::StartDB defines the parameter without a default. In C++, default arguments are statically bound, so this inconsistency allows one-argument calls that would violate the base class contract and weaken explicit shard propagation.

Suggested fix
-    bool StartDB(int64_t term, uint32_t data_shard_id = 0) override
+    bool StartDB(int64_t term, uint32_t data_shard_id) override
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@store_handler/eloq_data_store_service/eloq_store_data_store.h` around lines
180 - 183, The StartDB override currently declares a default value for
data_shard_id which differs from the base DataStore::StartDB signature; remove
the default argument from the derived declaration so the signature is identical
(change bool StartDB(int64_t term, uint32_t data_shard_id = 0) to bool
StartDB(int64_t term, uint32_t data_shard_id)), leaving the body that calls
eloq_store_service_->Start(term, data_shard_id) unchanged; ensure any callers
rely on the base's parameter semantics rather than the removed default.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tx_service/src/remote/cc_node_service.cpp (1)

1852-1895: ⚠️ Potential issue | 🟠 Major

Revalidate standby term inside the async task to avoid stale checkpoint updates.

Line 1852 validates StandbyNodeTerm() before enqueue, but the lambda at Line 1884 executes later. If standby term changes in between, this stale task can still call OnUpdateStandbyCkptTs and UpdateNodeGroupCkptTs under outdated term context. Add an execution-time term check in the lambda.

🔧 Proposed fix
-    auto standby_node_term = Sharder::Instance().StandbyNodeTerm();
+    const int64_t standby_node_term = Sharder::Instance().StandbyNodeTerm();
     if (standby_node_term == -1 ||
         (standby_node_term >> 32) != request->ng_term())
     {
@@
     const uint32_t ng_id = request->node_group_id();
     const int64_t ng_term = request->ng_term();
     const uint64_t ckpt_ts = request->primary_succ_ckpt_ts();
+    const int64_t expected_standby_term = standby_node_term;
     EnqueueStandbyTask(
         {StandbyTaskType::UpdateStandbyCkptTs,
          ng_id,
          ng_term,
          ckpt_ts,
-         [store_hd, ng_id, ng_term, ckpt_ts, has_data_store_write]()
+         [store_hd,
+          ng_id,
+          ng_term,
+          ckpt_ts,
+          has_data_store_write,
+          expected_standby_term]()
          {
+             const int64_t current_standby_term =
+                 Sharder::Instance().StandbyNodeTerm();
+             if (current_standby_term != expected_standby_term ||
+                 current_standby_term < 0 ||
+                 PrimaryTermFromStandbyTerm(current_standby_term) != ng_term)
+             {
+                 DLOG(INFO) << "Skip stale UpdateStandbyCkptTs task at execution, "
+                            << "ng_id=" << ng_id
+                            << ", expected_standby_term=" << expected_standby_term
+                            << ", current_standby_term=" << current_standby_term
+                            << ", ng_term=" << ng_term
+                            << ", ckpt_ts=" << ckpt_ts;
+                 return;
+             }
              const bool ok =
                  store_hd == nullptr
                      ? true
                      : store_hd->OnUpdateStandbyCkptTs(
                            ng_id, ng_term, ckpt_ts, !has_data_store_write);
              if (ok)
              {
                  Sharder::Instance().UpdateNodeGroupCkptTs(ng_id, ckpt_ts);
              }
          }});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tx_service/src/remote/cc_node_service.cpp` around lines 1852 - 1895, The
async task enqueued by EnqueueStandbyTask can run after StandbyNodeTerm() has
changed, so revalidate the term inside the lambda and abort the task if it no
longer matches ng_term: inside the lambda capture ng_term (already captured) and
call Sharder::Instance().StandbyNodeTerm(), then check the same condition used
before (term == -1 || (term >> 32) != ng_term) and return early if it fails;
only call store_hd->OnUpdateStandbyCkptTs and
Sharder::Instance().UpdateNodeGroupCkptTs when the in-task term check passes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tx_service/src/remote/cc_node_service.cpp`:
- Around line 1852-1895: The async task enqueued by EnqueueStandbyTask can run
after StandbyNodeTerm() has changed, so revalidate the term inside the lambda
and abort the task if it no longer matches ng_term: inside the lambda capture
ng_term (already captured) and call Sharder::Instance().StandbyNodeTerm(), then
check the same condition used before (term == -1 || (term >> 32) != ng_term) and
return early if it fails; only call store_hd->OnUpdateStandbyCkptTs and
Sharder::Instance().UpdateNodeGroupCkptTs when the in-task term check passes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ec533c31-602f-4ac4-9d55-a244a617ed0a

📥 Commits

Reviewing files that changed from the base of the PR and between 2b74669 and 175baa6.

📒 Files selected for processing (1)
  • tx_service/src/remote/cc_node_service.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant